33 research outputs found
VFHQ: A High-Quality Dataset and Benchmark for Video Face Super-Resolution
Most of the existing video face super-resolution (VFSR) methods are trained
and evaluated on VoxCeleb1, which is designed specifically for speaker
identification and the frames in this dataset are of low quality. As a
consequence, the VFSR models trained on this dataset can not output
visual-pleasing results. In this paper, we develop an automatic and scalable
pipeline to collect a high-quality video face dataset (VFHQ), which contains
over high-fidelity clips of diverse interview scenarios. To verify the
necessity of VFHQ, we further conduct experiments and demonstrate that VFSR
models trained on our VFHQ dataset can generate results with sharper edges and
finer textures than those trained on VoxCeleb1. In addition, we show that the
temporal information plays a pivotal role in eliminating video consistency
issues as well as further improving visual performance. Based on VFHQ, by
analyzing the benchmarking study of several state-of-the-art algorithms under
bicubic and blind settings. See our project page:
https://liangbinxie.github.io/projects/vfhqComment: Project webpage available at
https://liangbinxie.github.io/projects/vfh
Review of 2D Animation Restoration in Visual Domain Based on Deep Learning
Traditional 2D animation is a distinct visual style with a production process and image characteristics that differ significantly from real-life scenes. It usually requires drawing pictures frame by frame and saving them as bitmaps. During the storage, transmission, and playback process, 2D animation may encounter problems such as picture quality degradation, insufficient resolution, and discontinuous timing. With the development of deep learning technology, it has been widely used in the field of animation restoration. This paper provides a comprehensive summary of 2D animation restoration based on deep learning. Firstly, exploring existing animation datasets can help identify the available data support and the bottleneck in establishing animation datasets. Secondly, investigating and testing deep learning-based algorithms for animation image quality restoration and animation interpolation can help identify key points and challenges in animation restoration. Additionally, introducing methods designed to ensure consistency between animation frames can provide insights for future animation video restoration. Analyzing the effectiveness of existing image quality assessment (IQA) methods for animation images can help identify practical IQA methods to guide restoration results. Finally, based on the above analysis, this paper clarifies the challenges in animation restoration tasks and presents future development directions of deep learning in animation restoration field
Rethinking Alignment in Video Super-Resolution Transformers
The alignment of adjacent frames is considered an essential operation in
video super-resolution (VSR). Advanced VSR models, including the latest VSR
Transformers, are generally equipped with well-designed alignment modules.
However, the progress of the self-attention mechanism may violate this common
sense. In this paper, we rethink the role of alignment in VSR Transformers and
make several counter-intuitive observations. Our experiments show that: (i) VSR
Transformers can directly utilize multi-frame information from unaligned
videos, and (ii) existing alignment methods are sometimes harmful to VSR
Transformers. These observations indicate that we can further improve the
performance of VSR Transformers simply by removing the alignment module and
adopting a larger attention window. Nevertheless, such designs will
dramatically increase the computational burden, and cannot deal with large
motions. Therefore, we propose a new and efficient alignment method called
patch alignment, which aligns image patches instead of pixels. VSR Transformers
equipped with patch alignment could demonstrate state-of-the-art performance on
multiple benchmarks. Our work provides valuable insights on how multi-frame
information is used in VSR and how to select alignment methods for different
networks/datasets. Codes and models will be released at
https://github.com/XPixelGroup/RethinkVSRAlignment.Comment: This paper has been accepted for NeurIPS 202
Enhanced Quadratic Video Interpolation
With the prosperity of digital video industry, video frame interpolation has
arisen continuous attention in computer vision community and become a new
upsurge in industry. Many learning-based methods have been proposed and
achieved progressive results. Among them, a recent algorithm named quadratic
video interpolation (QVI) achieves appealing performance. It exploits
higher-order motion information (e.g. acceleration) and successfully models the
estimation of interpolated flow. However, its produced intermediate frames
still contain some unsatisfactory ghosting, artifacts and inaccurate motion,
especially when large and complex motion occurs. In this work, we further
improve the performance of QVI from three facets and propose an enhanced
quadratic video interpolation (EQVI) model. In particular, we adopt a rectified
quadratic flow prediction (RQFP) formulation with least squares method to
estimate the motion more accurately. Complementary with image pixel-level
blending, we introduce a residual contextual synthesis network (RCSN) to employ
contextual information in high-dimensional feature space, which could help the
model handle more complicated scenes and motion patterns. Moreover, to further
boost the performance, we devise a novel multi-scale fusion network (MS-Fusion)
which can be regarded as a learnable augmentation process. The proposed EQVI
model won the first place in the AIM2020 Video Temporal Super-Resolution
Challenge.Comment: Winning solution of AIM2020 VTSR Challenge (in conjunction with ECCV
2020
T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models
The incredible generative ability of large-scale text-to-image (T2I) models
has demonstrated strong power of learning complex structures and meaningful
semantics. However, relying solely on text prompts cannot fully take advantage
of the knowledge learned by the model, especially when flexible and accurate
structure control is needed. In this paper, we aim to ``dig out" the
capabilities that T2I models have implicitly learned, and then explicitly use
them to control the generation more granularly. Specifically, we propose to
learn simple and small T2I-Adapters to align internal knowledge in T2I models
with external control signals, while freezing the original large T2I models. In
this way, we can train various adapters according to different conditions, and
achieve rich control and editing effects. Further, the proposed T2I-Adapters
have attractive properties of practical value, such as composability and
generalization ability. Extensive experiments demonstrate that our T2I-Adapter
has promising generation quality and a wide range of applications.Comment: Tech Report. GitHub: https://github.com/TencentARC/T2I-Adapte
StyleAdapter: A Single-Pass LoRA-Free Model for Stylized Image Generation
This paper presents a LoRA-free method for stylized image generation that
takes a text prompt and style reference images as inputs and produces an output
image in a single pass. Unlike existing methods that rely on training a
separate LoRA for each style, our method can adapt to various styles with a
unified model. However, this poses two challenges: 1) the prompt loses
controllability over the generated content, and 2) the output image inherits
both the semantic and style features of the style reference image, compromising
its content fidelity. To address these challenges, we introduce StyleAdapter, a
model that comprises two components: a two-path cross-attention module (TPCA)
and three decoupling strategies. These components enable our model to process
the prompt and style reference features separately and reduce the strong
coupling between the semantic and style information in the style references.
StyleAdapter can generate high-quality images that match the content of the
prompts and adopt the style of the references (even for unseen styles) in a
single pass, which is more flexible and efficient than previous methods.
Experiments have been conducted to demonstrate the superiority of our method
over previous works.Comment: AIG
Mitigating Artifacts in Real-World Video Super-resolution Models
The recurrent structure is a prevalent framework for the task of video super-resolution, which models the temporal dependency between frames via hidden states. When applied to real-world scenarios with unknown and complex degradations, hidden states tend to contain unpleasant artifacts and propagate them to restored frames. In this circumstance, our analyses show that such artifacts can be largely alleviated when the hidden state is replaced with a cleaner counterpart. Based on the observations, we propose a Hidden State Attention (HSA) module to mitigate artifacts in real-world video super-resolution. Specifically, we first adopt various cheap filters to produce a hidden state pool. For example, Gaussian blur filters are for smoothing artifacts while sharpening filters are for enhancing details. To aggregate a new hidden state that contains fewer artifacts from the hidden state pool, we devise a Selective Cross Attention (SCA) module, in which the attention between input features and each hidden state is calculated. Equipped with HSA, our proposed method, namely FastRealVSR, is able to achieve 2x speedup while obtaining better performance than Real-BasicVSR. Codes will be available at https://github.com/TencentARC/FastRealVSR